Drug addiction is commonly referred to as an illness that doesn’t discriminate which is why i’m interested to see how geographic distribution (sometimes characterized by income, racial, and demographic strata) would aid in visualizing incidences of drug use. Admittedly, three of the data sets I will be using from the NSDUH database only estimate incidences of drug use which does not necessarily imply a dependence to substances, but I am also interested in mapping out the geographic distribution of drug use estimates in order to later compare it to drug arrest rates across different county locations. I will be focusing on heroin, cocaine, and marijuana use in the state of New York from the years of 2014 to 2016. The reason why I am not using a more recent dataset is because the NSDUH has only released datasets that estimate drug use starting from the age of 12, which would hamper the comparison with the New York arrest data since the latter describes individuals who are 18 or older.
## # A tibble: 3 Ă— 7
## outcome age_group years geography estimate ci_lower ci_upper
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Cocaine Use in the Past … 12 or Ol… 2014… West 0.02227… 0.02052… 0.02416…
## 2 Cocaine Use in the Past … 12 to 17 2014… West 0.00891… 0.00704… 0.01126…
## 3 Cocaine Use in the Past … 18 or Ol… 2014… West 0.02366… 0.02176… 0.02571…
## # A tibble: 3 Ă— 7
## outcome age_group years geography estimate ci_lower ci_upper
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Heroin Use in the Past Y… 12 or Ol… 2014… West 0.00297… 0.00235… 0.00375…
## 2 Heroin Use in the Past Y… 12 to 17 2014… West 0.00089… 0.00053… 0.00149…
## 3 Heroin Use in the Past Y… 18 or Ol… 2014… West 0.00319… 0.00252… 0.00404…
## # A tibble: 3 Ă— 7
## outcome age_group years geography estimate ci_lower ci_upper
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Marijuana Use in the Pas… 12 or Ol… 2014… West 0.16191… 0.15640… 0.16757…
## 2 Marijuana Use in the Pas… 12 to 17 2014… West 0.14199… 0.13453… 0.14981…
## 3 Marijuana Use in the Pas… 18 or Ol… 2014… West 0.16398… 0.15803… 0.17012…
I started off by loading all four data sets. All variables seem to be
identical which is great since I can create a function that cleans all
of my data sets by sub-setting the geography variable to
only include regions with the word “New York” in them, using
str_detect(). I am also filtering out entries in the
age_group variable I am not interested in
(12 or older, and 12 to 17). I have also
manually checked that the function cleans each data set correctly. The
age_group entry for 18 or Older would be
plentiful for my analysis but I will also include the
18 to 25 and 26 or Older entries as it might
be interesting to visualize the nuances of drug use among these two
different age groups.
## # A tibble: 11 Ă— 7
## outcome age_group years geography estimate ci_lower ci_upper
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.11629… 0.09561… 0.14076…
## 2 Marijuana Use in the Pa… 18 to 25 2014… New York… 0.26630… 0.21776… 0.32120…
## 3 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.08589… 0.06676… 0.10984…
## 4 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.23101… 0.20122… 0.26375…
## 5 Marijuana Use in the Pa… 18 to 25 2014… New York… 0.48161… 0.42278… 0.54095…
## 6 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.19409… 0.16325… 0.22917…
## 7 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.17636… 0.15326… 0.20213…
## 8 Marijuana Use in the Pa… 18 to 25 2014… New York… 0.34904… 0.30077… 0.40063…
## 9 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.14731… 0.12379… 0.17441…
## 10 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.12289… 0.09715… 0.15427…
## 11 Marijuana Use in the Pa… 18 to 25 2014… New York… suppres… suppres… suppres…
Some of the entries for specific age groups in each of the drug use datasets have been suppressed by the NSDUH (as an example I included the first 11 rows of the marijuana dataset, the last row has missing entries for estimates), therefore I may not be able to visualize trends across age groups as specifically as I had hoped.
The NSDUH provides a manual for their substate region definitions and
their respective counties, so I have created a dataframe for each of the
11 regions with a geography variable that I will later be
able to join with the datasets I just cleaned, as well as the
corresponding counties.
In order to retrieve the geometric data to create a map of New York
State, I used the “tigris” package in order to save that information
under a dataframe called new_york_county. This will be
useful later on for creating my map.
## Simple feature collection with 2 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -73.98328 ymin: 40.78536 xmax: -73.48271 ymax: 41.36654
## Geodetic CRS: NAD83
## STATEFP COUNTYFP COUNTYNS GEOIDFQ GEOID NAME NAMELSAD
## 1 36 005 00974101 0500000US36005 36005 Bronx Bronx County
## 2 36 119 00974157 0500000US36119 36119 Westchester Westchester County
## STUSPS STATE_NAME LSAD ALAND AWATER geometry
## 1 NY New York 06 109235672 39353304 MULTIPOLYGON (((-73.77242 4...
## 2 NY New York 06 1115811940 179391718 MULTIPOLYGON (((-73.77237 4...
I then join the dataframe to my combined_regions dataframe under a
new dataframe named new_york_county_regions which will
include the county geometry entries so I can map them, as well as the
previously mentioned geography variable which will allow me
to join the NSDUH data for regional estimates of drug use.
## Simple feature collection with 2 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -73.98328 ymin: 40.78536 xmax: -73.48271 ymax: 41.36654
## Geodetic CRS: NAD83
## STATEFP COUNTYFP COUNTYNS GEOIDFQ GEOID NAME NAMELSAD
## 1 36 005 00974101 0500000US36005 36005 bronx Bronx County
## 2 36 119 00974157 0500000US36119 36119 westchester Westchester County
## STUSPS STATE_NAME LSAD ALAND AWATER geography
## 1 NY New York 06 109235672 39353304 New York Region 2A: Bronx
## 2 NY New York 06 1115811940 179391718 New York Region 3: Mid-Hudson
## geometry
## 1 MULTIPOLYGON (((-73.77242 4...
## 2 MULTIPOLYGON (((-73.77237 4...
As a form of initial visualization, I plotted each drug use dataset
according to age group and there are many missing entries for more
specific age groupings (18 to 25 and
26 or Older). The Heroin dataset has missing values across
all age groups, even the 18 or Older group which I had
hoped to use for my final map. I had to disaggregate my datasets
according to each age group since I had trouble with the
join functions when those variables weren’t separated. I
was hoping to use the facet_wrap() function for brevity but
had to instead create maps for each of the age groups. As a brief
observation, the drug use percentages appear to be higher in proportion
for the 18 to 25 subgroup which is interesting to note.
However, I cannot firmly come to that conclusion due to the numerous
missing region entries for that subgroup.
This next part is related to the dataset from the New york State
Department’s arrest data for each of New york’s counties from the years
of 2013 up to 2022. Because the document is in excel format, I will skip
the first three rows when reading the file since those cells are empty
and weren’t formatted correctly. I then filter the data to exclude
non-drug arrests and create a new variable called category
which distinguishes between felony and misdemeanor arrests since those
variables were incorrectly formatted when reading the excel sheet.
Here are plots of total drug arrests and a plot that distinguishes between felony and misdemeanor arrests. Counties such as Kings, Queens, Suffolk, and Erie stand out signicantly in terms of high arrest numbers.
I first plotted the cocaine use dataset using the leaflet package
which allowed me to also plot the county drug arrests on top (red
circles) for better visualization. I was having some trouble with
plotting the arrest data since the geometry data wasn’t recognized as an
sf so after trial and error, I did some research and found
that the st_centroid() function from the sf
package would allow me to plot the data as I had hoped. The arrest data
was incredibly skewed with very large arrest numbers in counties such as
Erie, Bronx, and counties mostly in regions 1 and 2 so I also scaled the
diameter of the circles by taking the square root of the total arrests
and dividing that by 10.
Before comparing the two data with one another, it is important to note that there are many limitations. First, I am only plotting cases of specific drug use against much less specific drug arrest data. Further, some drug arrests may be related to an intent to distribute which doesn’t always coincide with personal drug use. Finally, this visual may be a bit misleading since it doesn’t take into account population size, metropolitan areas (albeit some can be intuitively identified for well-known counties in New York City). It also compares regional estimates for counties lumped together with arrests that are specific to counties so there are likely some lost aspects between specific counties and their drug trends.
With that said, regions 2B(Kings), 8, 9, and 5 show the highest estimates of cocaine use for individuals aged 18 or older (shown on the map in light purple). The arrest circles seem to be significantly larger In Erie county (region 11, outer left), Queens, Richmond, Kings (region 2) and some areas in some areas in region 1 (bottom right). This is interesting to see given that those regions show the lowest percentages of cocaine use, relative to other New York regions but perhaps dense populations (especially in region 2) may account for that disproportional trend.
As for marijuana use, regions 4, followed by region 9, and region 2B (Kings) now show the highest estimates of marijuana use among those 18 or older. The plot seems to be lighter in color overall (indicating higher use rates) with the exception of regions 3 and 7 which have lower percentages relative to the rest of the regions. The lighter regions just mentioned seem to have medium-sized arrest circles, however, Queens which also has a relatively high marijuana use rate seems to also have a very large arrest rate, relative to those two regions that show similar percentages. Again, this may be attributed to the area’s dense population.
This last plot is obviously missing two regions (regions 5 and 6) since the data was repressed by the NSDUH but we can still visualize disproportinate trends of heroin use in the remaining regions. Region 8 now shows the highest percentage of heroin use (center, pale yellow), followed by region 11 (left, light orange), region 2B (Kings, light orange), and then regions 2C (New York county), 4 and 7 (pink). This is the first instance in which high rates of drug use (as previously noted in Erie county, region 2 and region 1) seem to coincide with large arrest circles. Again, this is not necessarily related but it is interesting to note that region 8 has the highest rate of heroin use but doesn’t show the same large arrest rates, relative to the previously mentioned regions. Once again, this is probably related to population size, and density.